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ABSTRACT 


Understanding large-scale patterns in student course enroll- 
ment is a problem of great interest to university administra- 
tors and educational researchers. Yet important decisions 
are often made without a good quantitative framework of 
the process underlying student choices. We propose a prob- 
abilistic approach to modelling course enrollment decisions, 
drawing inspiration from multilabel classification and mix- 
ture models. We use ten years of anonymized student tran- 
scripts from a large university to construct a Gaussian latent 
variable model that learns the joint distribution over course 
enrollments. The models allow for a diverse set of inference 
queries and robustness to data sparsity. We demonstrate the 
efficacy of this approach in comparison to others, including 
deep learning architectures, and demonstrate its ability to 
infer the underlying student interests that guide enrollment 
decisions. 


1. INTRODUCTION 


Education researchers increasingly recognize the need to un- 
derstand the sequential accumulation of college coursework 
into academic pathways. In [2, 23], Bailey et al. call for 
change in how colleges organize course offerings to enable 
more efficient pathways. Rather than presenting a bewilder- 
ing array of courses, cafeteria-style, they recommend “guided 
pathways” through academic offerings. Baker [3] builds on 
Bailey’s work, suggesting “meta-majors” for simplifying choice 
without curtailment of options. Meta-majors entail com- 
bining coursework supporting multiple majors into larger, 
substantively coherent content domains. Baker proposes 
social-network analytic techniques to discover opportuni- 
ties for building meta-majors. All of these authors argue 
that rather than limiting choice, such interventions can yield 
more tractable programs, faster degree completion, and lower 
cost for both students and schools. 


Such reforms can be enabled by analysis of data corpora 
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describing the academic sequences of prior student enroll- 
ments. For example, some courses may be de facto pre- 
requisites for other courses, whether listed as ”required” or 
not in formal catalogue entries. Similarly, “odd” delays in 
taking particular courses, or unexplained detours in course 
selection, can be symptoms of unintended scheduling con- 
flicts. 


In the service of such reforms, we offer a model of course en- 
rollment capable of efficient inference over hundreds to thou- 
sands of classes. Our generative model captures the full joint 
distribution of course enrollments and can be used to sample 
potential pathways for any given student. The model’s com- 
plexity allows us to determine an underlying ”"typography” 
of students, from implicit course-taking patterns to differing 
levels of novelty in their academic pathways relative to the 
overall population of paths. 


2. BACKGROUND & MODELS 


Predicting course enrollment decisions may be viewed as a 
problem of multi-label classification: the task of assigning a 
subset of labels to each data point in a collection. In con- 
text of academic course enrollments, each data point is a 
student and the labels are courses enrolled. The problem 
of modeling all possible enrollment choices scales exponen- 
tially with the number classes (O(2%)), which motivates a 
statistical approach. Probabilistic graphical models (PGMs) 
and deep neural networks are perhaps the most prominent 
methods for stochastic models of high-dimensional data. As 
our motivation in this work is not simply high accuracy but 
also interpretability and inference, we focus on PGMs, which 
fare better on those aspects and are amenable to scaling ad- 
equate for our empirical setting. 


2.1 Latent Variable Models 


Latent variable models are a subclass of PGMs in which 
some variables are never observed in training data and are 
thus “latent.” These models are more computationally de- 
manding than fully observed models, but also are able to 
capture complex structure in data without supervision. 


2.1.1 Models of Conditional Independence 

Among the simplest and most commonly used latent vari- 
able models is the naive Bayes model with hidden variable 
Hf taking discrete values h; and observations X. In the en- 
rollment setting, X = [x°,...,2%] and a’ = [2,...,24,] with 
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Figure 1: Top: A stacked area plot of enrolled courses per university sub-school per year. 
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histogram of the number of courses taken by each individual student with a Gaussian fitting. 


x’, € {0,1} (0 denoting no enrollment and 1 an enrollment). 
The generative process of the model is described below: 


h' ~ Multinomial(@) 
a’, | h' ~ Multinomial(¢;) 


Given the value of the hidden variable, each individual prob- 
ability of enrolling in a course is independent. This is a poor 
inductive bias because enrollment decisions often influence 
one another, and the number of courses taken by a student 
in a given year is dependent on the courses taken. It is easier 
to capture these two facets of the data if we can model en- 
rollments jointly, without strong independence assumptions. 


2.1.2 Gaussian Mixture Model 


Any joint probability distribution over all discrete combina- 
tions of x’ € {0,1} requires 2” — 1 parameters and is thus 
intractable. One possible solution to is relaxation of the dis- 
crete problem to a real-valued vector space with X' € R™ 


and 
ree ot) 
7" 10 else 


By training a model over X, we can take advantage of real- 
valued distributions with much smaller parameter spaces. 


The Gaussian Mixture Model (GMM) is an archetypal latent 
variable model for real-valued data [22]. We can describe a 
GMM by generative process below: 


h' ~ Multinomial(@) 
a | hE N (2) 


We can modify the GMM for the setting of multi-label clas- 
sification by providing an unbiased estimator of the proba- 
bility of each binary sample: 


P(x = [1,0,..., 1]) = P(Zo > 0,21 < 0, PPR 0) 


= Fe eT aly’), 2@'))tly' > T) 


where y’ are samples drawn from a multivariate normal over 
a subset of the variables in x and p(y’), S(y’) are the pa- 
rameters of a multivariate normal conditioned on the value 
of y’. More detail on this estimator is provided in the online 
posting of this paper. 


At face value, it might seem odd to model enrollments as 
Gaussian-distributed. We choose this particular model both 
because it makes our real-valued relaxation tractable and be- 
cause we think it is reasonable to assume enrollments within 
each cluster will be fairly unimodal and smooth, especially 
as sample sizes increase. 


2.1.3 Contextual Mixture Model 

Hidden Markov Models (HMMs) are a common extension of 
stationary mixture models to sequential data [21]. In these 
models, the single latent variable is replaced with a Markov 
chain of hidden states. This model is naturally recursive, a 
property that is extremely useful when modeling processes 
that are positive recurrent. However, as enrollments often 
exhibit a strict order and returning to previous states is un- 
likely, we prefer a model that is strictly time-dependent or, 
as we will call it here, “contextual.” In general any Contex- 
tual Mixture Model (CMM) can be expressed using a Hid- 
den Markov model, but enforcing this structure allows us to 
incorporate priors that significantly improve the chances of 
training a plausible model. 


For a CMM with Gaussian emission probabilities, we have 
h° ~ Multinomial(0) 
ht! | hn’ ~ Multinomial(¢’) 
a | hi ~ N (ui, Zi) 


Note that the parameters of the transition and emission dis- 
tributions are different for each timestep. Figure 2 shows a 
diagram of our proposed model in plate notation. 


The small, discrete latent space of our model offers highly 
interpretable representations compared with the continuous 
latent vector space of neural architectures (see Fig. 4) and 
inference is highly efficient as the model has low tree-width. 
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Figure 2: Graphical representation of contextual 
mixture model using plate notation 


Our discrete probability estimator also allows modeling of 
all courses jointly, which is essential in this setting. 


2.1.4 Parameter Learning 

There are two primary methods of learning the parameters 
of the model we propose: expectation-maximization (EM) 
[8, 5] and gradient descent [6] on the log-likelihood objec- 
tive. The approximate probability estimate of the model 
also creates the possibility for differing levels of precision in 
accordance with the amount of computation one is willing to 
invest. For a highly biased learning process, one can simply 
train the model on data shifted from [0,1] to [—1,1] using 
the exact probability estimate. For a less biased learning 
process, one can use the probability estimator described in 
Section 2.1.2 and unbiased estimator of the gradient with 
respect to the model parameters. For more details, see the 
online posting of the paper. 


2.2 Baseline Models 


In Section 4 we present a comparison of our model with 
three baseline models. The first of these is an naive Bayes 
model with the strongest model assumptions. The second 
is tree-augmented naive Bayes [9], which adds dependencies 
between variables to better model the joint density. Both 
models with trained with EM. 


The last model we use for comparison is a Variational Au- 
toencoder (VAE)-a deep generative model [15]. We use a 
simple VAE with two fully connected layers in the encoder 
and decoder, trained on binary cross-entropy loss. 


It is important to note that while the VAE offers a good 
comparison point, the type of conditional inference (over 
sets of courses) that we describe for our Gaussian relaxation 
are not tractable in a standard VAE framework. In fact, 
VAE models can suffer from suboptimal inference in general 
when there is overfitting of the decoder network [7]. This 
issue is particularly concerning in this setting with relatively 
small sample size. 


3. EXPERIMENTS 


In this section, we describe the data used to train the model 
presented in Section 2 and how we evaluated them during 
training. 


3.1 Data 


We use eighteen years of course enrollment data from a large 
private university in the United States. The data comprise 
approximately 30, 000 student enrollment records with fields 
for course name and student major. We removed part-time 
and summer students from the dataset, limiting the analyses 
presented here to full-time academic-year enrollments only. 


Figure 1 shows two basic visualizations of the data after pre- 
processing. There are at least two notable takeaways from 
these plots. First, the proportion of enrollments in each 
academic division within the university remains relatively 
stable through most of the time period represented in the 
dataset. We use this fact to aggregate over time without 
explicitly modeling changes in enrollment patterns. Second, 
the fact that the number of courses taken is approximately 
Gaussian-distributed shows that enrollment patterns are not 
intensely multi-modal; thus the assumptions of probabilistic 
model are plausible. 


In what follows we replace full course names with abbrevi- 
ated proxies to enable universal legibility. For example, CS1 
corresponds to the introductory computer science class and 
“Alg” or “AI” correspond to algorithms or artificial intelli- 
gence classes respectively. 


4. EVALUATION 
4.1 Mean-Field Evaluation 


Though we can compare many of the models under con- 
sideration with log-likelihood alone, some only offer an ap- 
proximate lower bound (VAEs). Thus we provide another 
evaluation metric that can be used to compare any model 
that can generate sampled enrollments. 


For this loss function, we compare the empirical enrollment 
distributions in samples from our model and the distribu- 
tions of the hold-outs. Let ps be the probability that class 7 
is taken by any given student in the hold-out data, and pj 
be the corresponding probability in the samples. We take 
as our error, E(p’, p*) with 


E(p",p*) = > (} — v3) 
j 
which approximates the distance between the two true mul- 
tivariate distributions—the distribution of our model and the 
distribution of the data-if all the variables were independent 
(mean field approximation). 


4.2 Sample Quality 

In Figure 3 we compare the performance of our model on 
hold-out data relative to baseline models described in Sec- 
tion 2.2. We also include a direct comparison of the best 
performance for each model in Table 1. 


In Figure 3 we can see that our proposed model outperforms 
the two baselines across the board. It also is evident that the 
VAE baseline suffers bad generalization as the complexity of 
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Figure 3: Plots of the error for the proposed model 
and other models. The parameter K is the dimen- 
sion of the latent space. 


Model Sample Error | Inference Acc. 
Deep Generative 1.01 N/A 
Naive Bayes 0.78 60% 
Tree-Augment Naive Bayes 0.76 66% 
Our Gaussian 0.72 86% 


Table 1: A direct comparison of the best perfor- 
mance from each model on hold-out data 


the model increases. These models were trained adaptively, 
according to performance on the validation set, and thus are 
not simply underfitting due to increased training complexity. 


In comparing the graphical models, increased complexity— 
both in the observation model and latent space—leads to 
lower error. As the error calculation itself makes an indepen- 
dence assumption, it is not surprising that the performance 
of all three graphical models is relatively close. The true 
dominance of the model proposed here is perhaps most ev- 
ident in the inference task of Table 1, described in Section 
5.3. 


4.3 Visualizing Hidden Variables 


Beyond using the loss function defined in Section 4.1, we can 
also examine the hidden states of a trained model to vali- 
date the learning process. In particular, we can investigate 
whether the hidden space captures semantically meaning- 
ful categories. Figure 4 shows a visualization for our model 
trained on CS majors. The clusters in the grid correspond 
to required courses for three different concentration within 
the major’, and the color shows the most likely latent state 
assigned by the model to each course. As we can see, courses 
within the same concentration are assigned strikingly sim- 
ilar latent states by the model, suggesting that the model 
captures a semantically meaningful notion of the different 
concentrations in its hidden state. Therefore, if there are 
unknown correlations in course enrollments—for example 


‘These requirements were taken from the depart- 
ment website: https://exploredegrees.stanford.edu/ 
schoolofengineering/computerscience. 
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Figure 4: A visualization of the semantic meaning 
captured by the latent space of the model. Color- 
ing corresponds to hidden state, and translucency 
indicates the confidence of the model. 


many AI and biology courses taken together—this model 
could bring these patterns to the fore, allowing administra- 
tors an insight into possible ways to improve their degree 
concentrations. 


5. APPLICATIONS 


In this section we present results from two different experi- 
ments performed with our proposed model. These applica- 
tions demonstrate only a fraction of the model’s scope, but 
show its power to provide insights. 


5.1 Quantifying Enrollment Likelihood 

One of the useful applications made simple by our genera- 
tive model is in quantifying enrollment likelihood. A model 
trained on student enrollments will approximate the distri- 
bution of the training data. Thus if we evaluate the likeli- 
hood of a new student’s enrollments given the model, we can 
get a sense of how this student differs from the training ex- 
amples. Taking this principle to its extreme, we can train a 
model for each student on every other student’s enrollments, 
allowing us to model exactly how much each particular stu- 
dent varies from the typical. 


By examining the classes taken by students who are eval- 
uated as high versus low likelihood, we see that the model 
captures at least two meaningful axes of variance. Firstly, 
it recognizes that it is rare for students to take a very di- 
verse set of courses spanning many academic subjects. This 
insight is demonstrated in Figure 6, which shows the aver- 
age coursework for each type of student. The second insight 
that the model captures is the spectrum of ambition. More 
specifically, the model places very low probability on the 
small subgroup of students that take up to 30 computer sci- 
ence classes and places high probability on taking just the 
core requirements of the degree®. Atypical students take 
about 20 more courses than their counterparts on average. 


?We can identify this trend by looking at the exact classes 
that are most commonly taken by these students e.g. the 
core requirements. 
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Figure 5: Two shadings of the same Sankey diagram constructed from the CMM trained on CS undergradu- 
ate enrollments. Top: A common path taken by students engaging in pre-med requirements is highlighted in 
blue. Bottom: A common path for students committed to in-depth study of computer science is highlighted. 
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Figure 6: Pie charts representing the difference be- 
tween enrollment patterns captured by the model. 
Sections correspond to the average number of 
courses taken in a academic subject. 


5.2 Understanding Pathways 

Another capability of the model presented here is the abil- 
ity to analyze sequences of enrollments, from inferring likely 
paths between X and Y, to uncovering unspoken student 
strategies. We can display this visually using Sankey di- 
agrams in which the width of the line between adjacent 
segments is proportional to the transition probability be- 
tween the corresponding hidden states in the model. Fig- 
ure 5 shows this style of Sankey for CS students. In this 
diagram, we can note the two types of paths highlighted in 
the diagram. The first of these captures students who were 
actively taking the pre-medical requirements freshman and 
sophomore year. These same students were subsequently 
much more likely to take depth courses later and were more 
likely to focus on web development of information systems 
in their depth courses. We can contrast these students with 
the students that are highly committed to the CS major and 
its core classes starting freshman year. These students are 
much more likely to enroll in depth classes by their sopho- 
more year and are predisposed towards the systems and AI 
concentrations within the major. 


5.3 Inferring Intermediate Classes 

Another unique capability of our model is inferring the likeli- 
hood of intermediate classes. Given the classes taken fresh- 
man year and goal classes for senior year, the model can 
place a likelihood on intermediate classes. One possible use 
case for this ability is inference of soft or tacit prerequisites 
for courses. 


To test this aspect of the model, we predicted whether stu- 
dents would take each of 5 common classes in their sopho- 
more year given freshman and senior year enrollments. We 
were able to recover the correct enrollment with around 
86% probability. From this result it is clear that the model 
can learn a sensible joint distribution over multi-year enroll- 
ments. We can compare this performance with that of the 
baseline models in Table 1, noting a substantial gain. 


An even more interesting use case of this inference ability, 
however, is not simply prediction of common courses, but the 
potential for improving course selection tools. Only a model 
that captures the temporal dependencies across all courses 
is capable of offering helpful insights for goal-directed course 
selection. 


6. PRIOR WORK 


Much of the prior work on enrollment modeling in the uni- 
versity setting is dedicated purely to predictive models of 
future course enrollment [13, 18, 24] and academic perfor- 
mance [16, 11]. These models are largely incapable of pro- 
ducing the kinds of insights shown here. Preliminary work 
has also seen application of clustering algorithms to enroll- 
ments in form of latent variable models like Latent Dirichlet 
Allocation (LDA) [17] and recurrent neural networks [20]. 


Much of the state-of-the-art research in student decision 
modeling is now found in the study of massive open online 
courses (MOOCs). Gardner and Brooks [10] provide a thor- 
ough overview of modern models for the problem setting. 
Of note, Balakrishnan and Coetzee use a Hidden Markov 
Model (HMM) to predict attrition in MOOCs [4]. Similarly, 
Al-Shabandar et al. use Gaussian Mixture Models (GMMs) 
to cluster MOOC students at each timestep, and thus iden- 
tify clusters of students that are likely to withdraw from the 
courses [1]. Both of these models resemble ours though their 
task is prediction of simple binary outcomes. 


Work in course recommender systems is also inspiring. Kho- 
rasani et al. create a recommender based on a Markov model 
[14]. Jiang et al. use a neural-network system [12], and add 
the choice of using grade considerations to create custom 
course recommendations (also see [19]). This second model 
yields extremely compelling results, but is not capable of the 
broad range of inference queries possible with our model. 
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7. CONCLUSION & FUTURE WORK 


We have presented a new probabilistic model that is capable 
of capturing joint relationships between course enrollments, 
while also allowing for powerful inference queries. There is, 
however, at least one important drawback to our approach: 
the strictly Markovian character of the model. Although 
this assumption allows us to easily learn model parameters, 
in practice the enrollments observed at one timestep will 
impact those sampled at the next timestep. Because of this 
inductive bias our approach is effective with less training 
data than, for example, a recurrent neural net, and is there- 
fore more easily deployed for institutions smaller in size than 
our case university. 


We emphasize the potential for future work that links data 
of the sort investigated here with other rich information, 
such as demographic information describing students, and 
earned grades. Models incorporating such information could 
meaningfully identify differences between course trajectories 
of particular kinds of students, providing insights into how 
academic policies and programs might be tuned to benefit 
specific constituencies. 
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